This report explores a data set containing 4,898 white wines with 11 variables that quantify the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
Above we see the variables and their assigned data types in our data set .
We decided to change the variable
qualityinto a categorical ordinal variable - as we will not be performing any calculations on it but using to rate the wines on a od scale 0 (very bad) to 10 (very excellent)
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
> Variable quality We started by exploring
quality variable - which we saw was normally distributed with majority of the wine being rated 6 (above average)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Variable alcohol Alcohol content in this wine is left-skewed with the most common value at 9.5, and ranges from 8.00 to 14.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Variable suphates Sulphate content in this wine is normally distributed with the most common amount at 0.45, ranging from 0.22 to 1.08, it has mean level of 0.4898 and 50% of the wines have sulphate content of 0.47
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Variable pH the wine pH ranges from 2.7 to 3.82 and is nomally distibuted, with the most wines with a pH of 3.0 - with 50% to 75% of the wines with pH of 3.18 to 3.28
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## 99%
## 1.000302
Variable density On average the wine has a density level of 0.994 . We got to see the density is a normally distributed with 50% to 75% of the wines with density levels of 0.9937 to 0.9961
We then explored density variable and zoomed into the bulk of it distribution to get a better view on the plot by omitting the top 1% of density values - the most wines have density levels around 0.993
I think it would be important to learn how the amount alcohol per unit density affects the preference of the critics - therefore i will create a new variable
alcohol.density=alcohol/density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.025 9.517 10.445 10.580 11.486 14.376
## 99%
## 13.5418
Variable alcohol.density On average the wine has a density level of 10.58 . We got to see the alcohol.density is normally distributed with 50% to 75% of the wines with values ranging from 10.445 to 14.376
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## 99%
## 241.03
Variable sulphur.dioxide wine Total Sulfur Dioxide is normally distibuted - with at least 50% to 75% having a level ranging from 134.0 to 167.0
We then zoomed into the bulk of Total Sulfur Dioxide distribution to get a better view on the plot by omitting the top 1% of its values - the most wines have a Total Sulphur Dioxide level of 120
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## 99%
## 81
Variable sulphur.dioxide free.sulfur.dioxide
Variable sulphur.dioxide wine Free Sulfur Dioxide has normal distribution - with at least 25% to 75% having a levels ranging from 23.0 to 46.0
We then zoomed into the bulk of Free Sulfur Dioxide distribution to get a better view on the plot by omitting the top 1% of its values - the most wines have a Total Sulphur Dioxide level of 32.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## 95%
## 0.067
Variable chlorides 50 to 75% of the wine has chloride levels ranging from 0.043 to 0.05 - it has a normal distribution - We plotted the bulk of the distribution by cutting the top 5% of its values - this showed that a large number of the wines have chloride levels of 0.0475
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Variable residual.sugar On Average we saw the wine residual sugar level is at 6.391 . 50 to 75% of the wine had residual sugar levels of 5.2 to 9.9 with the maximum recorded at 65.8
We then omitted the top 1% of its values - and saw it had left skewed distribution with a long tail - so as not to get a better look at the distribution (without being distracted by its tail) we used a log transformation and observed a bi-modial distribution - with maximum counts at around 1.5 and around 9.75
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## 99%
## 0.74
Variable acidity We have some wines have citric acid of 0 - and a maximum citric acid content of 1.66 - with 50 to 75% of the wines having citric acid levels of 0.32 to 0.39. The wine citric acid has a normal distibution
We then zoomed into the bulk of Citric Acid distribution by omitting the top 1% of its values - and we saw the most of the wines have a level of 0.325
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## 99%
## 0.63
Variable volatile.acidity Wine Volatile Acidity levels of range from 0 to 1.1 - with 50 to 75% of the wine with levels ranging from 0.26 to 0.32 . Volatile Acidity is a normal distibution with most the wines having recorded levels of 0.275
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Variable fixed.acidity Wine Fixed Acidity levels of range from 3.8 to 14.2 with atleast 50 to 75% recording a levels from 6.8 to 7.3 . To get a better view of the bulk of the distribution we cut of the top 1% of its value we saw the most wines recorded a level of 6.875. Wine Fixed Acidity is normally distributed
There are 4,898 white wines in the dataset with 11 features (fixed and volatile acidity, free and total sulfur dioxide, citric acid, residual sugar, chloride and suplhate levels, wine pH, density and alcohol levels as well as the quality ratings).
The variable quality rating is an ordered factor variable with levels 0 (very bad) to 10 (very excellent)
Other observations :
PS : variable
Xrepresents the index of the observations and was not used during the Exploratory Data Analysis Process
The main feature in the data set is the quality rating. I would like to determine which features are best for predicting the quality rating of white wine.
Alcohol content, Acidity (fixed and volatile), Residual Sugar, Total Sulphur Oxide and pH are likely to contribute to quality rating.
Created a new variable
alcohol.density=alcohol/density- since I think knowing how much alcohol content there is in wine per unit density would affect the rating awarded by a critic - I would like to study how in the Bi-variate Analysis section
I found that residual sugar was left skewed with a long tail - so i used a log transformation on it. The transformed distribution now was bimodial with maximum counts at around 1.5 and around 9.75
## fixed.acidity volatile.acidity total.sulfur.dioxide
## fixed.acidity 1.00 -0.02 0.09
## volatile.acidity -0.02 1.00 0.09
## total.sulfur.dioxide 0.09 0.09 1.00
## pH -0.43 -0.03 0.00
## alcohol -0.12 0.07 -0.45
## alcohol.density -0.13 0.07 -0.45
## residual.sugar.log 0.07 0.09 0.42
## pH alcohol alcohol.density residual.sugar.log
## fixed.acidity -0.43 -0.12 -0.13 0.07
## volatile.acidity -0.03 0.07 0.07 0.09
## total.sulfur.dioxide 0.00 -0.45 -0.45 0.42
## pH 1.00 0.12 0.12 -0.18
## alcohol 0.12 1.00 1.00 -0.39
## alcohol.density 0.12 1.00 1.00 -0.40
## residual.sugar.log -0.18 -0.39 -0.40 1.00
I excluded all variables that are not being explored currently (with the exception of quality as it is a non-numerical variable).
I noted that below have high correlations with each other - Fixed Acidity and pH - has a high correlation (r = -0.43) - Residual Sugar and Total Sulfur Dioxide (r = 0.42) - Residual Sugar and Alcohol (r = -0.39) - Total Sulfur Dioxide and Alcohol (r = -0.45)
Then we added quality in the plot matrix and what stood out most was its relationship with alcohol (r = 0.44), alcohol.density (r = 0.43) , volatile acidity (r = -0.19) and total sulphur dioxide (r = -0.17).
Now I want to look closer at plots involving some variables like alcohol, alcohol, volatile acidity, fixed acidity, residual sugar, total sulfur dioxide, pH
We started by studying the relationship between quality and alcohol - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to alcohol and quality data
Overall it seems that critics are more likely to give a better rating when the wine has higher alcohol content. This is similar to what is describe by Waterhouse Lab (UC Davis) - wine with higher alcohols can have an aromatic effect.
Though this does not mean that alcohol content is the only feature that contributes to a better quality rating
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.582009 0.098008 5.938 3.08e-09 ***
## alcohol 0.313469 0.009258 33.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Above shows that alcohol contribute to about 19% of the quality rating (based on the R-squared) - implying we have other variables to the variation of quality ratings.
## # A tibble: 7 x 2
## quality mean_alcohol.price
## <ord> <dbl>
## 1 3 10.4
## 2 4 10.2
## 3 5 9.86
## 4 6 10.6
## 5 7 11.5
## 6 8 11.7
## 7 9 12.3
Similar to what we saw earlier better ratings were given for wines with higher alcohol content per density unit
Then we proceeded to study the relationship between quality and volatile acidity - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to pH and quality data
Waterhouse Lab (UC Davis) describes that volatile acidity is what defines wine spoilage and undersirable aromas - the lower its concentration should therefore improve the wine’s quality.
In line with my research , we can see with lower volatile acidity levels (around 0.15 to 0.35 ) - critics were more likely to give a better rating- 6 (above average)
Then we proceeded to study the relationship between quality and pH - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to pH and quality data
we see that the wine with that has middle range of pH (ie 3.0 to 3.35 - not too low or high pH) was more likely to be given a good rating (ie average - 5 or above average - 6)
Then we proceeded to study the relationship between quality and residual sugar - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to residual sugar and quality data. We also log-transformed the residual sugar - since as we had found on Univariate Plots Section - it does not have a normal distribution
we can see that Dry wine and Semi Sweet wine - wine with residual content (1 to 2 g/L) and (11 - 30 g/L) is more likely to be given a better rating compared to Off-Dry wine - wine with residual content of (5 - 10 g/L)
We see that with wines with lower total sulphur dioxide levels are awarded a better quality rating
Quality correlates moderately with alcohol content - the higher the alcohol content the more likely the critic will give a better rating. Though based on R-squared value (from the linear model fit) it only explains around 19% of the variance in price - other features of interest should be incorated in to the model to explain the variance in quality ratings
Better ratings were given when the volatile acidity and Total Sulphur Dioxide levels were low
Wine that has moderate pH (ie not too high or too low) were more likely to get a better quality rating
The critics preferred wine that was either dry or semi-sweet as opposed to off-dry wine
The wine’s residual sugar content is higher correlated to the alcohol level (r = -0.39) - the higher the alcohol level the lower the sugar content (and vice-versa). This is expected as the residual sugar is the sugar remaining after fermentation stops or is stopped.
Wine’s quality rating is highly correlated with alcohol levels (r = 0.44)
The scatter plots above show the trends that were noted on the Bivariate Plot and Analysis section. Better quality ratings were awarded when there was more alcohol-content, higher pH, lower fixed acidity levels, lower volatile acidity levels and lower (total) sulphur dioxide levels
We decided to facet the relationship between fixed acidity and pH with the quality rating to better show that better ratings are given when pH and acidity is not too high or too low.
The above plots suggest that we can build a linear model and use the above variables in the linear model to predict the quality rating a critic gives
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
##
## Re-fitting to get Hessian
##
## Calls:
## m1: polr(formula = quality ~ alcohol.density, data = data)
## m2: polr(formula = quality ~ alcohol.density + volatile.acidity,
## data = data)
## m3: polr(formula = quality ~ alcohol.density + volatile.acidity +
## log10(residual.sugar), data = data)
## m4: polr(formula = quality ~ alcohol.density + volatile.acidity +
## log10(residual.sugar) + free.sulfur.dioxide, data = data)
## m5: polr(formula = quality ~ alcohol.density + volatile.acidity +
## log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity,
## data = data)
## m6: polr(formula = quality ~ alcohol.density + volatile.acidity +
## log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity +
## total.sulfur.dioxide, data = data)
## m7: polr(formula = quality ~ alcohol.density + volatile.acidity +
## log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity +
## total.sulfur.dioxide + pH, data = data)
##
## ===========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ---------------------------------------------------------------------------------------------------------------------------
## alcohol.density 0.762*** 0.808*** 0.945*** 0.970*** 0.959*** 0.941*** 0.937***
## (0.024) (0.025) (0.027) (0.028) (0.028) (0.029) (0.029)
## 3|4 2.194*** 1.029** 2.797*** 3.402*** 2.137*** 1.873*** 4.508***
## (0.327) (0.335) (0.367) (0.381) (0.461) (0.474) (0.891)
## 4|5 4.447*** 3.328*** 5.130*** 5.741*** 4.486*** 4.220*** 6.850***
## (0.250) (0.259) (0.300) (0.318) (0.408) (0.423) (0.864)
## 5|6 7.202*** 6.254*** 8.126*** 8.751*** 7.503*** 7.235*** 9.868***
## (0.248) (0.255) (0.301) (0.319) (0.408) (0.423) (0.865)
## 6|7 9.593*** 8.742*** 10.658*** 11.294*** 10.053*** 9.790*** 12.429***
## (0.269) (0.274) (0.320) (0.339) (0.422) (0.436) (0.874)
## 7|8 11.784*** 10.928*** 12.873*** 13.521*** 12.289*** 12.026*** 14.668***
## (0.287) (0.291) (0.337) (0.356) (0.434) (0.448) (0.881)
## 8|9 15.446*** 14.587*** 16.542*** 17.194*** 15.966*** 15.703*** 18.345***
## (0.527) (0.529) (0.556) (0.568) (0.620) (0.630) (0.986)
## volatile.acidity -5.093*** -5.695*** -5.517*** -5.555*** -5.410*** -5.345***
## (0.288) (0.294) (0.295) (0.295) (0.302) (0.303)
## log10(residual.sugar) 0.940*** 0.817*** 0.834*** 0.861*** 0.904***
## (0.077) (0.080) (0.080) (0.081) (0.082)
## free.sulfur.dioxide 0.011*** 0.010*** 0.013*** 0.014***
## (0.002) (0.002) (0.002) (0.002)
## fixed.acidity -0.162*** -0.153*** -0.099**
## (0.033) (0.033) (0.037)
## total.sulfur.dioxide -0.002* -0.003**
## (0.001) (0.001)
## pH 0.727***
## (0.208)
## ---------------------------------------------------------------------------------------------------------------------------
## Aldrich-Nelson R-sq. 0.186 0.227 0.245 0.250 0.252 0.253 0.254
## McFadden R-sq. 0.089 0.114 0.126 0.129 0.131 0.131 0.132
## Cox-Snell R-sq. 0.205 0.255 0.278 0.283 0.286 0.287 0.289
## Nagelkerke R-sq. 0.221 0.276 0.300 0.306 0.310 0.311 0.313
## Likelihood-ratio 1121.493 1441.589 1592.852 1629.259 1653.123 1658.516 1670.772
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5759.841 -5599.792 -5524.161 -5505.958 -5494.025 -5491.329 -5485.201
## Deviance 11519.682 11199.585 11048.322 11011.916 10988.051 10982.658 10970.402
## AIC 11533.682 11215.585 11066.322 11031.916 11010.051 11006.658 10996.402
## BIC 11579.158 11267.558 11124.791 11096.881 11081.513 11084.617 11080.858
## N 4898 4898 4898 4898 4898 4898 4898
## ===========================================================================================================================
After adding all the variables under study we now account for 25.4% of the variance in the wine quality rating
Wine with higher amount of alcohol and lower total sulphur dioxide - were more likely to get better quality ratings
Better ratings were give when there were trace amounts of Free sulfur dioxide and lower levels of volatile acidity
From the Multivariate section I can now build a ordinal logistic model and use those variables to predict the wine critic’s quality rating
From research i saw that pH and acidity also affect the quality of wine - with too little acidity the wine can described as flat and unappealing; while if its too high the wine is so tart that that it would not be pleasing -
In line with the above most of the critics were observed to be more likely to give a good rating for wine that was not too high or to low on the pH and fixed acidity spectrum
Yes - I created an ordinal regression model starting with quality rating as described by alcohol which accounted for 19% of quality’s variation
I later updated it with variables volatile.acidity, residual.sugar (log transformed), free.sulfur.dioxide, fixed.acidity, total.sulfur.dioxide, pH - the model now accounts for 25.4% of quality’s variation
Above is the scatter plot on the relationship between fixed acidity and pH with facetted over quality rating. This shows that most critics give better ratings when pH and acidity is not too high or too low.
The scatter plot on the relationship between Alcohol Content Vs Residual Sugar - shows that wine with higher alcohol content (and lower residual sugar) were more likely to receive a better quality rating
This is the distribution of Residual Sugar - it was left-skewed with a long tail. We performed a log transformation on it and observed a bi-modial distribution - with maximum counts at around 1.5 and around 9.75
The White Wine data set contains information on 4,898 white wines with 11 variables that quantify the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
As expected higher alcohol levels in the wine resulted in the critics giving a better quality grading on the wines - given that higher alcohols give an aromatic effect in wines. Also wine that had acidity (fixed) and pH that were not too high or too low were more likely to get better ratings - this was in line with what I found on Napa Valley Register website, that better wines have a proper acid levels. Too much acidity and the wine would be so tart it wouldn’t be pleasing; too little acidity and the wine becomes flat, dull and unappealing with food.
However, even after creating a model to describe quality based on the chemical features : alcohol/density, volatile and fixed acidity, residual sugar (log transformed), free and total sulfur dioxide and pH - I was only able to account for 25.4% of the wine’s quality rating variation. I think either more features and more data would be required to better understand and predict the quality ratings assigned to white wine.